Kristian K Ullrich
Max Planck Institute for Evolutionary Biology
A matter of distance.
Synonymous and NonSynonymous Substitutions: nucleotide substitutions might lead to changes on amino acid level.
Typically used to:
An R/Bioconductor package to calculate pairwise distances between all sequences of a MultipleSequenceAlignment (DNAStringSet or AAStringSet) and to conduct codon based analysis.
MSA2dist Features:
From Bioconductor:
From GitHub:
Data: Whole-genome coding sequences and gene ranges for Chlorophyta algae + Physcomitrium patens as an outgroup.
Source: Pico-PLAZA 3.0
Where to find: the data/ directory in the GitHub repo associated with this presentation.
data
├── clusters.rda
├── codingsequences.rda
└── data_acquisition_cds.md
codingsequences.rda: DNAStringSet object.clusters.rda: Pre-calculated syntenet clusters.codingsequences# Load necessary libraries
library(MSA2dist)
library(Biostrings)
library(dplyr)
library(stringr)
# Load data set codingsequences
load(here::here("data", "codingsequences.rda"))
# Inspect data
head(names(codingsequences))[1] "AP00G00010" "AP00G00020" "AP00G00030" "AP00G00040" "AP00G00050"
[6] "AP00G00060"
clusters Gene Cluster
1 Chlo_CNC64A_028G00030 1
2 Chlo_CNC64A_028G00040 2
3 Chlo_CNC64A_028G00070 3
4 Chlo_CNC64A_028G00080 4
5 Chlo_CNC64A_028G00110 5
6 Chlo_CNC64A_028G00140 6
[1] 22605
1 2 3 4 5 6
38 38 37 2 9 4
# Get one random cluster of size 10
my_cluster <- clusters %>% dplyr::filter(
Cluster==sample(which(table(clusters$Cluster)==10), 1))
head(my_cluster) Gene Cluster
1 Bpra_BPRRCC1105_14G02260 15527
2 Bpra_BPRRCC1105_14G02990 15527
3 Osp_ORCC809_02G02140 15527
4 Osp_ORCC809_02G01300 15527
5 Oluc_OL02G04530 15527
6 Oluc_OL02G02820 15527
# Remove species unique identifier and add as GeneID
my_cluster$GeneID <- stringr::str_split_fixed(my_cluster$Gene, "_", 2)[, 2]
my_cluster Gene Cluster GeneID
1 Bpra_BPRRCC1105_14G02260 15527 BPRRCC1105_14G02260
2 Bpra_BPRRCC1105_14G02990 15527 BPRRCC1105_14G02990
3 Osp_ORCC809_02G02140 15527 ORCC809_02G02140
4 Osp_ORCC809_02G01300 15527 ORCC809_02G01300
5 Oluc_OL02G04530 15527 OL02G04530
6 Oluc_OL02G02820 15527 OL02G02820
7 Omed_OM_04G02150 15527 OM_04G02150
8 Omed_OM_04G01300 15527 OM_04G01300
9 Otau_OT_02G00670 15527 OT_02G00670
10 Otau_OT_02G01820 15527 OT_02G01820
# Fetch coding sequences
my_cds <- codingsequences[match(my_cluster$GeneID,
names(codingsequences))]
my_cdsDNAStringSet object of length 10:
width seq names
[1] 2151 ATGAGCATAGAAACCAAGCAACG...CGATATCATATATTCGGTGTGA BPRRCC1105_14G02260
[2] 2130 ATGGGCGTACCCGATCCCGATTT...TAAGAATATAACTTTTGTTTAG BPRRCC1105_14G02990
[3] 2124 ATGTCCACCGATGAGTTGGACAT...TCACCAAATCCATAGAATTTAG ORCC809_02G02140
[4] 1950 ATGCTCTATGGTGATATTCTACG...TAGTATTACGTTCGTTAGCTAG ORCC809_02G01300
[5] 2154 ATGGCCGATGAAGCATACACGCG...CAACATAACGTTCGTAGACTAG OL02G04530
[6] 2124 ATGGAACGCGAGTTCGAAGCCTT...TGGAATTGTGCACATCATTTGA OL02G02820
[7] 2139 ATGTTTGAACCAGATTATCGAAG...AAACATCACGTTCGTAGAATAG OM_04G02150
[8] 2133 ATGAACAACGACGGCGTCGACGG...GGGTCACATCCACTGTCTCTAG OM_04G01300
[9] 2124 ATGGAGCGCGAGTTCGTGAATTT...AGGCACGGTGCACATTATTTAG OT_02G00670
[10] 2133 ATGAGTGAAGAGGTGTACGCTCA...TAATATTACATTCATAGAGTAA OT_02G01820
To calculate a coding sequence alignment for two sequences:
# Get coding sequence alignment
cds1_cds2_aln <- cds2codonaln(cds1 = my_cds[1], cds2 = my_cds[2], remove.gaps = FALSE)
cds1_cds2_alnDNAStringSet object of length 2:
width seq names
[1] 2367 ATGAGCATAGAAACCAAGCAACG...ATATC---ATATATTCGGTGTGA BPRRCC1105_14G02260
[2] 2367 ATGGGCGTACCCGATCCC-----...ATAAGAATATAACTTTTGTTTAG BPRRCC1105_14G02990
To calculate dN/dS from this alignment:
To calculate all pairwise combinations:
start_time <- Sys.time()
# Calculate dN/dS for the whole cluster; model = YN
my_cds_kaks <- dnastring2kaks(my_cds,
model = "YN", threads = 2, isMSA = FALSE)
end_time <- Sys.time()
head(my_cds_kaks) Comp1 Comp2 seq1 seq2 Method Ka
result.1 1 2 BPRRCC1105_14G02260 BPRRCC1105_14G02990 YN 0.68691
result.2 1 3 BPRRCC1105_14G02260 ORCC809_02G02140 YN 0.385174
result.3 1 4 BPRRCC1105_14G02260 ORCC809_02G01300 YN 0.714543
result.4 1 5 BPRRCC1105_14G02260 OL02G04530 YN 0.734026
result.5 1 6 BPRRCC1105_14G02260 OL02G02820 YN 0.370458
result.6 1 7 BPRRCC1105_14G02260 OM_04G02150 YN 0.670571
Ks Ka/Ks P-Value(Fisher) Length S-Sites N-Sites
result.1 4.48181 0.153266 2.47864e-58 1911 393.764 1517.24
result.2 4.64889 0.0828529 6.04249e-105 2052 492.021 1559.98
result.3 4.54757 0.157126 7.84145e-49 1887 429.845 1457.15
result.4 4.5871 0.16002 1.32653e-43 1941 453.107 1487.89
result.5 4.59787 0.0805717 1.22049e-143 2010 459.662 1550.34
result.6 4.56461 0.146907 3.77102e-72 1896 439.721 1456.28
Fold-Sites(0:2:4) Substitutions S-Substitutions N-Substitutions
result.1 NA 1026 345.973 680.027
result.2 NA 885 415.871 469.129
result.3 NA 1032 362.557 669.443
result.4 NA 1067 372.541 694.459
result.5 NA 880 427.41 452.59
result.6 NA 1039 395.111 643.889
Fold-S-Substitutions(0:2:4) Fold-N-Substitutions(0:2:4)
result.1 NA NA
result.2 NA NA
result.3 NA NA
result.4 NA NA
result.5 NA NA
result.6 NA NA
Divergence-Time Substitution-Rate-Ratio(rTC:rAG:rTA:rCG:rTG:rCA/rCA)
result.1 1.46886 1.26809:1.26809:1:1:1:1
result.2 1.40751 1.5774:1.5774:1:1:1:1
result.3 1.58768 1.29662:1.29662:1:1:1:1
result.4 1.63349 1.2953:1.2953:1:1:1:1
result.5 1.33721 1.3531:1.3531:1:1:1:1
result.6 1.57368 1.32171:1.32171:1:1:1:1
GC(1:2:3) ML-Score AICc Akaike-Weight Model
result.1 0.360794(0.455006:0.333333:0.294043) NA NA NA NA
result.2 0.408784(0.501351:0.372297:0.352703) NA NA NA NA
result.3 0.39507(0.489145:0.358209:0.337856) NA NA NA NA
result.4 0.396866(0.481576:0.361499:0.347522) NA NA NA NA
result.5 0.402299(0.48939:0.375332:0.342175) NA NA NA NA
result.6 0.380594(0.465496:0.343162:0.333124) NA NA NA NA
How long did it take?
Here, a pre-calculated MSA is necessary:
Plot average behavior of each codon:
Kristian K Ullrich @kullrich